1 Introduction

The majority of the following analyses have been inspired by the pipeline used in the article Rossi et al. Cell Stem Cell 2020. The code associated with this publication is available on GitHub in the Cardiac_Gastruloids repository.

The goal is to analyze and compare the data obtained in our lab with the data retrieved from the Rossi et al. article. %A% mais notre goal c’est paas que de comparer ? On a l’impression que si vu la façon dont la phrase est formulée. Peut etre qu’il faut juste ajouter la description de notre dataset avant pour clarifier. %C% … The data used in the Rossi et al. article will be named “reference data” in the rest of the document. %A% and the data that were procuded in our lab ? Give the name here too. These data were recovered from the GEO website with the accession number GEO: GSE158999. The accessed files are:

  • barcodes.tsv.gz,

  • features.tsv.gz,

  • matrix.mtx.gz,

  • metadata.tsv.gz.

The Rossi et al. article also refers and uses the in vivo atlas from Pijuan-Sala Griffiths Gottgens et al. Nature 2019. Indeed, in Rossi et al. a classifier has been trained on the atlas data. %C% à reformuler I directly download the classifier from the above-mentioned GitHub repository.

I also downloaded the data of the atlas of Pijan-Sala et al. using a terminal with the following command lines

curl https://content.cruk.cam.ac.uk/jmlab/atlas_data.tar.gz > atlas_data.tar.gz
tar -zxvf atlas_data.tar.gz

2 General settings

2.1 Directories managment

To reproduce the analyses, it is recommended to set up all these directories. I assume that all the input files are stored in the same directory inputData. This directory is divided in 5 subdirectories: %A% “this” directory refers to ‘inputDData’ ? %C% mieux vaut que je répète ‘inputData’ ?

  • 2 of them are dedicated to the raw data to be analyzed, from the Marseille Medical Genetics (MMG) laboratory and reference datasets (zaffranRawData and rossiRawData, respectively);

  • 1 directory contains the data to create an atlas object from Pijuan-Sala et al. (atlas);

  • 1 directory stores the classifier based on the atlas of Pijuan-Sala et al. and provided by the reference Rossi et al. (scmap_pijuansala_classifier),

  • 1 directory contains the input tables provided by the reference Rossi et al. (InputTables). %A% c’est quoi ces tables ? C’est la liste données plus haut? %C% rien à voir avec la liste en-haut. Fichier tsv pour nommer les clusters par ex, ou pour utiliser toujours les mêmes couleurs. Il y a une liste de genes d’interet aussi (utilisée dans l’analyse de Rossi)

basePath <- "XXXXX"  # to be seen as the root for any analysis based on a set of input data
# input paths
atlas.directory <- paste0(basePath, "inputData/atlas/")
classifier.folder <- paste0(basePath, "inputData/scmap_pijuansala_classifier/")
inputTables.folder <- paste0(basePath, "inputData/InputTables/")

# output paths
baseAnalysis <- paste0(basePath, "XXXXX")
if (!dir.exists(baseAnalysis)) {
    dir.create(baseAnalysis)
}
rdsObjects <- paste0(baseAnalysis, "rdsObjects/")
if (!dir.exists(rdsObjects)) {
    dir.create(rdsObjects)
}
atlas.folder <- paste0(baseAnalysis, "atlas/")
if (!dir.exists(atlas.folder)) {
    dir.create(atlas.folder)
}
fig.folder <- paste0(baseAnalysis, "figures/")
if (!dir.exists(fig.folder)) {
    dir.create(fig.folder)
}

2.2 Load custom color tables

To keep a coherent coloration on the plots throughout the analyses, I set up a vector of colors.

colors.table <- read.table(file = paste0(inputTables.folder, "ClusterColors.tsv"), sep = "\t", header = T, 
    comment.char = "", as.is = T)
colors.use_ab.initio <- setNames(colors.table$blind_friendly[!is.na(colors.table$ab_initio_identity)], colors.table$ab_initio_identity[!is.na(colors.table$ab_initio_identity)])
colors.use_transferred <- setNames(colors.table$blind_friendly[!is.na(colors.table$transferred_identity)], 
    colors.table$transferred_identity[!is.na(colors.table$transferred_identity)])
colors.use_stages <- setNames(c("#be465c", "#b04b3d", "#c86633", "#d0a63f", "#998138", "#90b03d", "#60a756", 
    "#45c097", "#5e8bd5", "#6d71d8", "#573585", "#bd80d5", "#b853a2", "#ba4b7d"), c("Day_04", "Day_05", "Day_06", 
    "Day_07", "Day_10", "E6.5", "E6.75", "E7.0", "E7.25", "E7.5", "E7.75", "E8.0", "E8.25", "E8.5"))

2.3 Custom variables for the analysis %A% est-ce que c’est le bon titre ? Ca a l’air de parler exclusivement de // dessous

Some functions can work with parallelization. I here configure how parallelization should work.

# Parallelization
plan(strategy = "multisession", workers = 4)
# plan(batchtools_slurm, template =
# '/shared/projects/mothard_in_silico_modeling/slurm_scripts/aa_seuratAn.tmpl')
options(future.globals.maxSize = 62914560000)  #60GB
options(future.seed = 26)

3 Reference data

3.1 Load the reference dataset

The Read10X function only needs the directory in which the input files have been stored. The function creates an object interpreting the barcodes, features and matrix files.
The CreateSeuratObject function initialize a seurat object.

ref.dataset.folder <- paste0(basePath, "inputData/ref_data/")

ref.raw.data <- Read10X(data.dir = ref.dataset.folder, gene = 1)
ref.4d.SO <- CreateSeuratObject(counts = ref.raw.data, project = "recoverAnalysis")  #, assay='RNA', min.cells=3, min.features=100)

# Dimensions of the object
head(ref.4d.SO)
dim(ref.4d.SO)
## An object of class Seurat 
## 1 features across 30496 samples within 1 assay 
## Active assay: RNA (1 features)
## [1] 23961 30496

The reference dataset contains 30496 cells, on which 23961 features (genes) were detected.

3.2 Add metadata

The metadata file was created in the Rossi et al. analysis and is provided to guide the users. I add the stage and Replicate metadata to the seurat object under the day and replicate column names, respectively.

ref.md <- read.table(file = paste0(ref.dataset.folder, "metadata.tsv.gz"), sep = "\t", header = TRUE, stringsAsFactors = F)
ref.4d.SO <- AddMetaData(object = ref.4d.SO, metadata = ref.md$stage, col.name = "day")
ref.4d.SO <- AddMetaData(object = ref.4d.SO, metadata = ref.md$Replicate, col.name = "replicate")

Here are all the metadata information contained in the seurat object of the reference dataset: str(ref.4d.SO@meta.data)

## 'data.frame':    30496 obs. of  5 variables:
##  $ orig.ident  : Factor w/ 1 level "GAS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ nCount_RNA  : num  53183 41938 43631 69730 64691 ...
##  $ nFeature_RNA: int  6676 5225 5869 7090 7039 5568 6208 5684 3450 6052 ...
##  $ day         : chr  "Day4" "Day4" "Day4" "Day4" ...
##  $ replicate   : int  0 0 0 0 0 0 0 0 0 0 ...

There are 5 metadata stored in a data frame. Each of them describe the cells. Besides the imported metadata from the external file (day and replicate), there are n_Count_RNA and nFeature_RNA information: %A% du coup ca fait 4 au total, les 2 importées et les 2 décrites ici, alors que tu parles de 5 ?

  • nCount_RNA is the number of reads counted in every droplet with a gel bead associated to a cell;

  • nFeature_RNA is the number of different features (genes) detected in every droplet.

We can get the different values a metadata can take by the following command. Here is an example for the metadata named day: unique(ref.4d.SO@meta.data$day)

## [1] "Day4" "Day5" "Day6" "Day7"

3.2.1 Homogenize labels (day, dataset) %C% Labels modification - creation

For the sake of the analysis, I need to manipulate the labels to make them coherent between all the labels. %A% pas très clair: cohérent avec quoi ? Avec les données de l’atlas et les données de MMG ? %C% In the reference dataset, the day metadata is loaded in the previous part. However, the I edit it in order to nicely sort them (5<11 and not the reverse).

# modify day label
ref.4d.SO@meta.data$day <- gsub("^(Day)([0-9]*)$", "\\1_0\\2", ref.4d.SO@meta.data$day)

# create new metadata named dataset
ref.4d.SO$dataset <- paste0("ref_", tolower(ref.4d.SO@meta.data$day))
Idents(ref.4d.SO) <- ref.4d.SO$dataset

ref_dataset_names <- unique(ref.4d.SO$dataset)

A new metadata has been created. The values this metadata can take are among ref_day_04, ref_day_05, ref_day_06, ref_day_07 depending on the origin and the stage (day) of the sequenced cells.

Now, here’s what the metadata looks like:

orig.ident nCount_RNA nFeature_RNA day replicate dataset
GAS_Day4_batch4_AAACCCATCGACTCCT GAS 53183 6676 Day_04 0 ref_day_04
GAS_Day4_batch4_AAACGAAGTCATAACC GAS 41938 5225 Day_04 0 ref_day_04
GAS_Day4_batch4_AAACGCTCAAGCGAGT GAS 43631 5869 Day_04 0 ref_day_04
GAS_Day4_batch4_AAACGCTCACCCTAAA GAS 69730 7090 Day_04 0 ref_day_04
GAS_Day4_batch4_AAAGAACAGCCTTGAT GAS 64691 7039 Day_04 0 ref_day_04
GAS_Day4_batch4_AAAGAACAGTTTCGAC GAS 39653 5568 Day_04 0 ref_day_04

3.3 Split into sub-datasets

For the following parts of the analysis, I will split the dataset ros.4d.SO into 4 sub-datasets. I use the SplitObject function from the Seurat package (PROVIDE REFERENCE). %A% yep By providing an attribute (metadata column name), the object is splitted into a list of subsetted objects.

I split the dataset according to the day the gastruloids were sequenced.%A% maybe explain why ? %C% Check the protocol, but it hsould be because of the sequencing was realized per days separately, it comes out that days are the entity. Before being merged,it is recommended to perform a quality control on every entity (see next part).

# Split the object
refSO.list <- SplitObject(ref.4d.SO, split.by = "dataset")

# set up the project name of each sub-dataset
for (SO in refSO.list) {
    projName <- unique(SO@meta.data$dataset)
    refSO.list[[projName]]@project.name <- projName
}

Size of the entire dataset:

Dataset Nbr of features Nbr of cells
ref_4_days 23961 30496

Sizes of the sub-datasets:

Dataset Nbr of features Nbr of cells
ref_day_04 ref_day_04 23961 6898
ref_day_05 ref_day_05 23961 7783
ref_day_06 ref_day_06 23961 8313
ref_day_07 ref_day_07 23961 7502

4 MMG laboratory data

The MMG lab got the scRNAseq data day by day. The sequencing saturation was not good enough for the day 5. As the consequence, this day has been resequenced. It hence have five datasets: one dataset for the days 4, 6 and 10 and two datasets for the day 5.

For each dataset, I will execute the following steps:

  • load and interpret raw data (barcodes, features and matrix files) with Read10X function and create a seurat object (CreateSeuratObject function),

  • add metadata.

4.1 Load the 5 sub-datasets

There are five sub-datasets to load. I create a seurat object for each of them, and store them in a list. The project name of each sub-dataset is already given.

# get path to the folders of each sub-dataset
lab_dirs <- list.files(paste0(basePath, "inputData/lab_data"), full.names = TRUE)
print(lab_dirs)

# create a list of seurat objects
labSO.list <- list()
lab_dataset_names <- c()
for (x in lab_dirs) {
    dataset.name <- str_extract(x, "[a-z0-9_]*$")
    lab_dataset_names <- append(lab_dataset_names, dataset.name)

    raw.data <- Read10X(data.dir = x, gene = 2)
    SO <- CreateSeuratObject(counts = raw.data, project = dataset.name)  #, assay='RNA', min.cells=3, min.features=100)

    labSO.list[[dataset.name]] <- SO
}

labSO.list
## [1] "/shared/projects/mothard_in_silico_modeling/seurat_analysis/inputData/lab_data/lab_day_04"   
## [2] "/shared/projects/mothard_in_silico_modeling/seurat_analysis/inputData/lab_data/lab_day_05"   
## [3] "/shared/projects/mothard_in_silico_modeling/seurat_analysis/inputData/lab_data/lab_day_05bis"
## [4] "/shared/projects/mothard_in_silico_modeling/seurat_analysis/inputData/lab_data/lab_day_06"   
## [5] "/shared/projects/mothard_in_silico_modeling/seurat_analysis/inputData/lab_data/lab_day_11"   
## $lab_day_04
## An object of class Seurat 
## 32286 features across 7324 samples within 1 assay 
## Active assay: RNA (32286 features)
## 
## $lab_day_05
## An object of class Seurat 
## 32286 features across 7794 samples within 1 assay 
## Active assay: RNA (32286 features)
## 
## $lab_day_05bis
## An object of class Seurat 
## 32286 features across 7729 samples within 1 assay 
## Active assay: RNA (32286 features)
## 
## $lab_day_06
## An object of class Seurat 
## 32286 features across 6628 samples within 1 assay 
## Active assay: RNA (32286 features)
## 
## $lab_day_11
## An object of class Seurat 
## 32287 features across 9584 samples within 1 assay 
## Active assay: RNA (32287 features)

Similarly to the project names of the reference sub-datasets, the project names of the MMG lab sub-datasets take their values among: lab_day_04, lab_day_05, lab_day_05bis, lab_day_06, lab_day_11.

4.2 Add metadata

Similarly to the reference dataset, I need to add metadata to the MMG lab dataset. I will add the information of the day the gastruloids were sequenced, the dataset name - in a similar way as the reference dataset names were built - and the replicate.

The replicate metadata is required as there are two datasets sequenced on the day 5. It will be useful to differenciate cells between both datasets.

labSO.list <- lapply(labSO.list, function(SO) {
    dataset <- SO@project.name
    day <- str_to_title(str_extract(dataset, "day_[0-9]*$"))

    SO <- AddMetaData(object = SO, metadata = day, col.name = "day")
    SO <- AddMetaData(object = SO, metadata = dataset, col.name = "dataset")

    if (dataset == "lab_day_05bis") {
        SO <- AddMetaData(SO, metadata = 1, col.name = "replicate")
    } else {
        SO <- AddMetaData(SO, metadata = 0, col.name = "replicate")
    }
})

data.frame(head(labSO.list[[1]]@meta.data)) %>%
    knitr::kable(align = "lrrrrrr") %>%
    kable_styling(bootstrap_options = c("striped", "hover"))
orig.ident nCount_RNA nFeature_RNA day dataset replicate
AAACCCAAGTACAGAT lab_day_04 59952 6798 Day_04 lab_day_04 0
AAACCCAAGTTCATCG lab_day_04 38805 5932 Day_04 lab_day_04 0
AAACCCACAAGCGAAC lab_day_04 45520 6008 Day_04 lab_day_04 0
AAACCCACAAGGTTGG lab_day_04 26327 4497 Day_04 lab_day_04 0
AAACCCACACATATGC lab_day_04 28763 4954 Day_04 lab_day_04 0
AAACCCACACTAAACC lab_day_04 27481 5004 Day_04 lab_day_04 0

5 Nomenclature for cell names

The names of the cells have a different nomenclature in the reference and the MMG laboratory sub-datasets. On one hand, the reference cell names are already custom names with the biological origin (GAS for gastruloids), the days the gastruloid were sequenced, the batch number and the UMI. On the other hand, the MMG laboratory cell names only consists of the UMIs.

The library used by 10X Genomics is limited, and provide a limited number of unique UMI barcodes. (PROVIDE DETAILS). Hence, working with data obtained from multiple sequencing experiments might lead to an overlap of UMI barcodes between datasets.

I hence decided to create a nomenclature to rename all the cells under a unique identifier. Thus, each cell will be identified by:

  • ref/lab whether the data come from Rossi et al. or the MMG laboratory,

  • the dataset value previously added to the metadata, %A% c’est pas number plutot que value ici?

  • the replicate number, also present in the metadata and%A% c’est pas number plutot que value ici? %C% changed

  • the UMI provided by 10x Genomics.

rename_cells <- function(SO) {
    cell_ids <- colnames(SO)
    UMIs <- str_extract(cell_ids, "[A-Z]*$")
    cellnames <- paste(SO$dataset, SO$replicate, UMIs, sep = "_")

    SO <- RenameCells(SO, new.names = cellnames)
}

refSO.list <- lapply(refSO.list, rename_cells)
labSO.list <- lapply(labSO.list, rename_cells)
print(colnames(refSO.list$ref_day_04)[1])
print(colnames(labSO.list$lab_day_04)[1])
## [1] "ref_day_04_0_AAACCCATCGACTCCT"
## [1] "lab_day_04_0_AAACCCAAGTACAGAT"

We can see that there are as many unique cell identifiers as the number of cells among all the sub-datasets.

nbCells <- Reduce("+", lapply(refSO.list, function(SO) dim(SO)[2])) + Reduce("+", lapply(labSO.list, function(SO) dim(SO)[2]))

nbIDs <- Reduce("+", lapply(refSO.list, function(SO) length(unique(colnames(SO))))) + Reduce("+", lapply(labSO.list, 
    function(SO) length(unique(colnames(SO)))))

print(paste("Nbr of cells among all the sub-datasets:", nbCells))
print(paste("Nbr of cell identifiers:", nbIDs))
## [1] "Nbr of cells among all the sub-datasets: 69555"
## [1] "Nbr of cell identifiers: 69555"

6 - - -

Now, all the sub-datasets, either from the reference and the MMG lab data, are prepared for the analysis. %A% cette analyse concerne uniquement les lab data ou aussi les reference data? Si c’est les deux, comme c’est indiqué plus bas, l’indiquer ici. %C% modified First, I will control the quality of the cells, after what I will perform the standard preprocessing workflow of Seurat. %A% “after what I will realize” pas clair, réaliser est un faux-ami en anglais. Si tu veux dire : après cela j’applique le workflow de preprocessing standard de Seurat, c’est pas vraiment ce qui et écrit %C% modified Then, I will be able to remove the doublets using DoubletFinder library (PROVIDE SOURCE). %A% pourquoi “be able”? Tu pourrais pas le faire si tu avais pas appliqué les étapes précédentes? %C% il faut que les étapes de préprocess soient effectuées pour utiliser DF Finally, I will investigate if there are sources of unwanted variation among the replicates.%A% pourquoi que parmis les réplicats et pas entre les conditions ?

First and foremost, I gather all the data into a common object. Thus, they are stored in a list. %A% je comprends pas pourquoi “ainsi ils sont stockés dans une liste”? %C% simplification : First and foremost, I gather all the data into a list. The computational advantage of it takes place in the use of the function of lapply() . It allows to apply a function to all members of a list in one command.
All the sub-datasets (reference and MMG lab data) will be analyzed on the same pipeline.

allSO.list <- do.call(c, list(refSO.list, labSO.list))
# allSO.list <- append(refSO.list, labSO.list) # Equivalent allSO.list <- c(refSO.list, labSO.list) #
# Equivalent
SO.names <- names(allSO.list)
rm(refSO.list, labSO.list)
gc(verbose = FALSE)
str(allSO.list, max.level = 2)
##             used   (Mb) gc trigger    (Mb)   max used    (Mb)
## Ncells   6380552  340.8   11687185   624.2    8825061   471.4
## Vcells 990866865 7559.8 1475152559 11254.6 1466330759 11187.3
## List of 9
##  $ ref_day_04   :Formal class 'Seurat' [package "Seurat"] with 12 slots
##  $ ref_day_05   :Formal class 'Seurat' [package "Seurat"] with 12 slots
##  $ ref_day_06   :Formal class 'Seurat' [package "Seurat"] with 12 slots
##  $ ref_day_07   :Formal class 'Seurat' [package "Seurat"] with 12 slots
##  $ lab_day_04   :Formal class 'Seurat' [package "Seurat"] with 12 slots
##  $ lab_day_05   :Formal class 'Seurat' [package "Seurat"] with 12 slots
##  $ lab_day_05bis:Formal class 'Seurat' [package "Seurat"] with 12 slots
##  $ lab_day_06   :Formal class 'Seurat' [package "Seurat"] with 12 slots
##  $ lab_day_11   :Formal class 'Seurat' [package "Seurat"] with 12 slots

Now, all sub-datasets Seurat objects are stored into a list. %A% c’est pas une répétition de ce qui est plus haut ? %C% un peu, mais en haut je précise l’avantage d’avoir une liste. Ici, ça me permet de dire que la liste est nommée. (visible dans le rapport grace a la commande str() juste au-dessus) The names of the list are the project.name of each sub-dataset.

As the number of cells could decrease during the following steps, I will print a table with the number of cells, but also the number of detected features in each sub-dataset. A barplot will be also displayed to see the evolution of each sub-dataset along the analysis.

Size of all the sub-datasets:

size_suball <- data.frame(cbind(sapply(allSO.list, function(SO) {
    if (str_detect(SO@project.name, "^ref")) {
        "ref"
    } else {
        "lab"
    }
}), sapply(allSO.list, function(SO) {
    if (str_detect(SO@project.name, "04")) {
        "4"
    } else if (str_detect(SO@project.name, "05bis")) {
        "5bis"
    } else if (str_detect(SO@project.name, "05")) {
        "5"
    } else if (str_detect(SO@project.name, "06")) {
        "6"
    } else if (str_detect(SO@project.name, "07")) {
        "7"
    } else if (str_detect(SO@project.name, "11")) {
        "11"
    }
}), t(sapply(allSO.list, dim)), "Preliminary Counts", "01"))
colnames(size_suball) <- c("origin", "day", "Nbr_of_features", "Nbr_of_cells", "Analysis_step_name", "Step_number")
size_suball$Nbr_of_cells <- as.integer(as.character(size_suball$Nbr_of_cells))
size_suball$Nbr_of_features <- as.integer(as.character(size_suball$Nbr_of_features))
size_suball$day <- factor(size_suball$day, levels = c("4", "5", "5bis", "6", "7", "11"), ordered = TRUE)
size_suball %>%
    knitr::kable(align = "lrrrrlr") %>%
    kable_styling(bootstrap_options = c("striped", "hover"))
origin day Nbr_of_features Nbr_of_cells Analysis_step_name Step_number
ref_day_04 ref 4 23961 6898 Preliminary Counts 01
ref_day_05 ref 5 23961 7783 Preliminary Counts 01
ref_day_06 ref 6 23961 8313 Preliminary Counts 01
ref_day_07 ref 7 23961 7502 Preliminary Counts 01
lab_day_04 lab 4 32286 7324 Preliminary Counts 01
lab_day_05 lab 5 32286 7794 Preliminary Counts 01
lab_day_05bis lab 5bis 32286 7729 Preliminary Counts 01
lab_day_06 lab 6 32286 6628 Preliminary Counts 01
lab_day_11 lab 11 32287 9584 Preliminary Counts 01
ggplot(size_suball, aes(x = day, y = Nbr_of_cells)) + geom_col(aes(color = origin, fill = origin), position = position_dodge(), 
    width = 0.7) + theme_classic() + theme(plot.title = element_text(hjust = 0.5)) + ggtitle("Number of cells in each sub-dataset") + 
    facet_grid(. ~ origin, scales = "free_x", space = "free")

7 Quality control

The cells can be filtered-out based on multiple criteria, like:

  • the number of features detected in a cell transcriptome,

  • the number of counts for each single cell or,

  • the percentage of mitochondrial counts.

The feature plot of the nCount_RNA an the percent.mt metadata will guide me regarding the threshold to apply for filtering out the low quality cells.

7.1 Percentage of mitochondrial reads & cutoffs

To determine the mitochondrial percentage in each cell, I use the mitochondrial gene name that always begins by “mt-”. The function PercentageFeatureSet enables to calculate the percentage of all counts that match the pattern. Here, I use the “mt-” pattern. (REFORMULATE)

allSO.list <- future_lapply(allSO.list, function(SO) {
    SO[["percent.mt"]] <- PercentageFeatureSet(SO, pattern = "^mt-")
    return(SO)
}, future.seed = 26)

The percentage of mitochondrial counts (or reads) is stored for each cell as a metadata under the name of percent.mt (see the example on the data of the lab, day 4).

orig.ident nCount_RNA nFeature_RNA day dataset replicate percent.mt
lab_day_04_0_AAACCCAAGTACAGAT lab_day_04 59952 6798 Day_04 lab_day_04 0 2.340206
lab_day_04_0_AAACCCAAGTTCATCG lab_day_04 38805 5932 Day_04 lab_day_04 0 1.154490
lab_day_04_0_AAACCCACAAGCGAAC lab_day_04 45520 6008 Day_04 lab_day_04 0 1.001758
lab_day_04_0_AAACCCACAAGGTTGG lab_day_04 26327 4497 Day_04 lab_day_04 0 1.762449
lab_day_04_0_AAACCCACACATATGC lab_day_04 28763 4954 Day_04 lab_day_04 0 2.718771
lab_day_04_0_AAACCCACACTAAACC lab_day_04 27481 5004 Day_04 lab_day_04 0 3.114879
lapply(allSO.list, function(SO) {
    VlnPlot(SO, features = c("percent.mt"), ncol = 3) + geom_hline(aes(yintercept = 10), color = "blue", size = 1.5) + 
        geom_hline(aes(yintercept = 1.5), color = "orange", size = 1.5)
})

The reference sub-datasets show a mitochondrial content that never exceed 15% of the transcripts in the cells. In addition, cells never have less than 1.5% of mitochondrial content. It is the reason why there is no orange line on the 4 first plots (corresponding to the plots on reference sub-datasets). It is possible that the data were already filtered. %A% tu peux pas le voir dans le code qu’ils donnent sur github? %C% dans le code github, il y a des filtres, mais sur les données qui sont publiées, ils ne servent à rien. Les données publiées sont clairement tronquées sur les cellules de mauvaise qualité. In contrast, some cells in the MMG lab data have a mitochondrial content close to 80%.

I will keep the cells whose percentage of mitochondrial genes is between 1.5% and 10% (orange and blue lines, resp.).

Information coming from 10X Genomics

Apoptotic cells express mitochondrial genes and export these transcripts to the cytoplasm in mammalian cells. Lysed cells with intact mitochondria may be partitioned into GEMs, increasing the fraction of mitochondrial transcripts detected.

It is also mentioned in Luecken MD and Theis FJ Mol Syst Biol. 2019 that a high mitochondrial content may represent doublets.

7.2 Counts’ thresholds

Rossi et al. was filtering out cells with a number of read counts lower than 2000 (green line). Added to this one, I will also remove the cells with a number of read counts higher than 150.000 (purple line).%A% Je comprends pas la formulation “Added to this one,” %C% In addition to this threshold,

lapply(allSO.list, function(SO) {
    VlnPlot(SO, features = c("nCount_RNA"), ncol = 3) + geom_hline(aes(yintercept = 150000), color = "purple", 
        size = 1.5) + geom_hline(aes(yintercept = 3000), color = "green", size = 1.5)
})

The reference sub-datasets never have less than 3000 read counts. It is the reason why there is no purple line on the 4 first plots (plots on reference sub-datasets). Actually, it appears that they never have less than 5000 read counts.

Furthermore, the higher threshold leads to filter out some outliers either in reference and laboratory sub-datasets. %A% il faudrait vérifier la consistance et nommer les données de Fabienne/Stéphane toujours de la même manière: il y a un mélange de MMG, MMG lab, lab, laboratory … Je pense que MMG lab est le plus clair. %C% OKAY, à faire The 2 sub-datasets of the MMG lab fifth day do not reach such number of read counts. %A% je pense que les datasets splittés par jour sont appelés sub-datasets ailleurs, garder toujours la meme terminologie pour éviter les confusions. Sinon ca veut dire que tu ne considères aucun des deux dataset jour 5 ? %C% corrigé ; c’est seulement de la description des plots qui sont affichés en amont. Comme je n’ai aucun point au-dessus du seuil, la ligne ne s’affiche pas pour les plots mentionnés

7.3 Minimum of feature’s diversity

The number of feature per cell is also assessed for quality control.

lapply(allSO.list, function(SO) {
    VlnPlot(SO, features = c("nFeature_RNA"), ncol = 3) + geom_hline(aes(yintercept = 1800), color = "#56B4E9", 
        size = 1.5)
})

The reference sub-datasets never have less than 2000 features per cell, while the MMG lab sub-datasets range from 38 to 10064 features per cell. Many cells of the lab sub-datasets show a low number of detected features. Thus, a minimum of 1800 detected features per cell will be requested (blue line).

7.4 Isolated features

Among the thousands of detected features, some of them are expressed only in few cells. Such features do not provide information on the cell heterogeneity. %A% ben si au contraire, du coup c’est pas très clair %C% few cells ~ 10-15 cellules. Sur les jeux de données de 7000 cellules, ça n’apporte aucune information. ça pourrait aussi bien être de la contamination. It is also important to notice that filtering out features expressed in less than 50 cells (red dashed line) will make difficult to detect clusters that could gather less than 50 cells.

# Create dataframe
countcells.df.list <- lapply(allSO.list, function(SO) {
    df <- data.frame(rowSums(SO@assays$RNA@counts != 0))
    df$features <- rownames(df)
    colnames(df) <- c("Nbr_of_cells", "features")
    rownames(df) <- NULL
    return(df)
})

lapply(1:length(countcells.df.list), function(i) {
    project.name <- names(countcells.df.list)[i]
    df <- countcells.df.list[[i]]
    # Plot gene expression repartition
    ggplot(df, aes(x = Nbr_of_cells)) + geom_histogram() + theme_minimal() + theme(plot.title = element_text(hjust = 0.5)) + 
        ylab("frequency") + scale_x_continuous(trans = "log10") + expand_limits(x = c(0, 10500), y = c(0, 
        2000)) + geom_vline(aes(xintercept = 8), color = "green", size = 1) + geom_vline(aes(xintercept = 50), 
        color = "red", size = 1, linetype = "dashed") + ggtitle(paste0("Repartition of expressed features\n", 
        project.name))
})

In order to limit the number of dropouts, while still being able to identify small clusters, the features expressed in less than 8 cells will be removed. %A% mais le nombre de dropout ne se controle pas non ? Tu controles juste le filtre? %C% c’est ça, en jouant sur les filtres, on peut limiter dans une petite mesure le nombre de dropout. remplacer “to limit” par “to minimize” ?

7.5 Apply QC

Considering the multiple quality control criteria alltogether, I confirm the previously settled thresholds. The low quality cells will be removed according to these selected thresholds.

# lapply(allSO.list, function(SO) { print(names(SO@meta.data)) FeatureScatter(object=SO,
# feature1='nCount_RNA', feature2='nFeature_RNA', pt.size=0.8, cols='percent.mt') + geom_point(data =
# data.frame(SO@meta.data$nCount_RNA, SO@meta.data$nFeature_RNA, SO@meta.data$percent.mt), aes(x =
# 'nCount_RNA', y = 'nFeature_RNA'), color='percent.mt') + geom_hline(aes(yintercept = 1500),
# color='#56B4E9', size=1) + geom_vline(aes(xintercept = 150000), color='purple', size=1) +
# geom_vline(aes(xintercept = 3000), color='green', size=1) + ggtitle(paste0('Nb of features as function
# of Nb of counts\n', SO@project.name)) + scale_color_continuous(type = 'viridis') })

lapply(allSO.list, function(SO) {
    ggplot(SO@meta.data, aes(nCount_RNA, nFeature_RNA, colour = percent.mt)) + geom_point() + lims(colour = c(0, 
        100)) + theme_minimal() + theme(plot.title = element_text(hjust = 0.5)) + ylab("nFeature_RNA") + geom_hline(aes(yintercept = 1800), 
        color = "#56B4E9", size = 1) + geom_vline(aes(xintercept = 150000), color = "purple", size = 1) + 
        geom_vline(aes(xintercept = 3000), color = "green", size = 1) + ggtitle(paste0("Nb of features as function of Nb of counts\n", 
        SO@project.name))
})

The cells with a high content of mitochondrial genes are under the blue line (minimum of 1800 features), and also at the left of the green line (3.000 read counts minimum). Such cells with a low count depth (count per cell), few detected features and a high mitochondrial content are indicative of cells whose cytoplasmic mRNA has leaked out through a broken membrane, and thus, only mRNA located in the mitochondria is still conserved (Luecken MD and Theis FJ Mol Syst Biol. 2019).

In this section, I apply all the cutoffs determined in the previous steps. As a result, the sub-datasets will have either less features or less cells.

allSO.list <- lapply(1:length(allSO.list), function(i) {
    df <- countcells.df.list[[i]]
    SO <- allSO.list[[i]]
    SO <- subset(SO, features = which(df$Nbr_of_cells > 8), subset = percent.mt > 1.5 & percent.mt < 10 & 
        nCount_RNA > 3000 & nCount_RNA < 150000 & nFeature_RNA > 1800)
    return(SO)
})
names(allSO.list) <- c(ref_dataset_names, lab_dataset_names)

Size of all the sub-datasets after filtering:

size_suball2 <- data.frame(cbind(
    sapply(allSO.list, function(SO){
        if (str_detect(SO@project.name, "^ref")){ "ref" } else { "lab" }
    }),
    sapply(allSO.list, function(SO){
        if (str_detect(SO@project.name, "04")) { "4" }
        else if (str_detect(SO@project.name, "05bis")) { "5bis" }
        else if (str_detect(SO@project.name, "05")) { "5" }
        else if (str_detect(SO@project.name, "06")) { "6" }
        else if (str_detect(SO@project.name, "07")) { "7" }
        else if (str_detect(SO@project.name, "11")) { "11" }
    }),
    
    t(sapply(allSO.list, dim)),
    "QC_filtering",
    "02"
))
colnames(size_suball2) <- c('origin', 'day', 'Nbr_of_features', 'Nbr_of_cells', 'Analysis_step_name', 'Step_number')
size_suball2$Nbr_of_cells <- as.integer(as.character(size_suball2$Nbr_of_cells))
size_suball2$Nbr_of_features <- as.integer(as.character(size_suball2$Nbr_of_features))
size_suball2$day <- factor(size_suball2$day, levels = c('4', '5', '5bis', '6', '7', '11'), ordered = TRUE)
size_suball_track <- rbind(size_suball, size_suball2)
size_suball_track %>%
  knitr::kable(align = "lrrrrlr") %>% 
  kable_styling(bootstrap_options = c("striped", "hover")) %>%
  scroll_box(width = "100%", height = "200px")
rm(size_suball, size_suball2)
origin day Nbr_of_features Nbr_of_cells Analysis_step_name Step_number
ref_day_04 ref 4 23961 6898 Preliminary Counts 01
ref_day_05 ref 5 23961 7783 Preliminary Counts 01
ref_day_06 ref 6 23961 8313 Preliminary Counts 01
ref_day_07 ref 7 23961 7502 Preliminary Counts 01
lab_day_04 lab 4 32286 7324 Preliminary Counts 01
lab_day_05 lab 5 32286 7794 Preliminary Counts 01
lab_day_05bis lab 5bis 32286 7729 Preliminary Counts 01
lab_day_06 lab 6 32286 6628 Preliminary Counts 01
lab_day_11 lab 11 32287 9584 Preliminary Counts 01
ref_day_041 ref 4 17928 6726 QC_filtering 02
ref_day_051 ref 5 18282 7637 QC_filtering 02
ref_day_061 ref 6 18247 8133 QC_filtering 02
ref_day_071 ref 7 18279 7385 QC_filtering 02
lab_day_041 lab 4 17679 5518 QC_filtering 02
lab_day_051 lab 5 18786 6728 QC_filtering 02
lab_day_05bis1 lab 5bis 18582 6689 QC_filtering 02
lab_day_061 lab 6 17927 5196 QC_filtering 02
lab_day_111 lab 11 18354 6398 QC_filtering 02
ggplot(size_suball_track, aes(x = day, y = Nbr_of_cells)) + geom_col(aes(color = Analysis_step_name, fill = Analysis_step_name), 
    position = position_dodge(), width = 0.6) + theme_classic() + theme(plot.title = element_text(hjust = 0.5)) + 
    ggtitle("Number of cells in each sub-dataset\nafter QC") + facet_grid(. ~ origin, scales = "free_x", space = "free")

SAY SOMETHING THAT ROSSI ET AL DATA WERE ALREADY PREPROCESSED %A% yep

8 Seurat Standard Preprocessing Workflow

To compare the gene expression across multiple cells, the data have to be normalized. I use the default parameters of the NormalizeData function. It means that the method used is called LogNormalize. With this method, the feature counts of each cell are divided by the total counts of that cell and multiplied by the scale.factor (10.000, by default). This is then natural-log transformed using log1p.

allSO.list <- future_lapply(allSO.list, NormalizeData, verbose = FALSE, future.seed = 26)

The normalization is calculated from the raw counts slot named counts. The result is then stored in an another slot named data.

data.frame(GetAssayData(allSO.list$ref_day_04, slot = "counts")[1:3, 1:5]) %>%
    knitr::kable(caption = "Raw counts in 'counts' slot") %>%
    kable_styling(bootstrap_options = c("striped", "hover")) %>%
    scroll_box(width = "800px", height = "200px")
data.frame(GetAssayData(allSO.list$ref_day_04, slot = "data")[1:3, 1:5]) %>%
    knitr::kable(caption = "Normalized counts in 'data' slot") %>%
    kable_styling(bootstrap_options = c("striped", "hover")) %>%
    scroll_box(width = "800px", height = "200px")
Raw counts in ‘counts’ slot
ref_day_04_0_AAACCCATCGACTCCT ref_day_04_0_AAACGAAGTCATAACC ref_day_04_0_AAACGCTCAAGCGAGT ref_day_04_0_AAACGCTCACCCTAAA ref_day_04_0_AAAGAACAGCCTTGAT
Xkr4 0 0 0 0 0
Rp1 0 0 0 0 0
Sox17 5 0 0 0 0
Normalized counts in ‘data’ slot
ref_day_04_0_AAACCCATCGACTCCT ref_day_04_0_AAACGAAGTCATAACC ref_day_04_0_AAACGCTCAAGCGAGT ref_day_04_0_AAACGCTCACCCTAAA ref_day_04_0_AAAGAACAGCCTTGAT
Xkr4 0.0000000 0 0 0 0
Rp1 0.0000000 0 0 0 0
Sox17 0.6628109 0 0 0 0

I continue with the standard preprocessing steps, using default parameters:

  • highly variable features (HVG) detection, with the function FindVariableFeatures , where 2000 features are identified as having the most cell-to-cell variation;

  • scale the data, via a linear transformation, using the ScaleData function, on all features (more details below);

  • perform linear reduction dimension with Principal Component Analysis (PCA) to identify the sources of heterogeneity in every sub-dataset, through 50 principal components (PCs);

  • plot an ElbowPlot to visualize the most informative PCs.

allSO.list <- future_lapply(allSO.list, function(SO) {
    SO <- FindVariableFeatures(SO, nfeatures = 2000, verbose = FALSE)
    SO <- ScaleData(SO, features = rownames(SO), verbose = FALSE)
    SO <- RunPCA(SO, verbose = FALSE)
    return(SO)
}, future.seed = 26)

The function ScaleData performs a linear transformation (so called scaling). In details, it shifts gene expressions to make the mean at 0 (negative values arise), and scales the gene expressions to have their variance equal to 1. Concretely, it standardizes the expression of each gene. This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate (<https://satijalab.org/seurat/articles/pbmc3k_tutorial.html>).%A% equal weight to what? %C% By this step, highly-expressed genes do not dominate over others, …

From the scaling step, a new slot named scale.data has been created. %A% c’est quoi “slot”? C’est dans la structure des objects seurat ; […] new slot named scale.data is added to the seurat object. Below is an example of the same features in the same cells as above.

data.frame(GetAssayData(allSO.list$ref_day_04, slot = "scale.data")[1:3, 1:5]) %>%
    knitr::kable(caption = "Scaled data in 'scale.data' slot") %>%
    kable_styling(bootstrap_options = c("striped", "hover")) %>%
    scroll_box(width = "800px", height = "200px")
Scaled data in ‘scale.data’ slot
ref_day_04_0_AAACCCATCGACTCCT ref_day_04_0_AAACGAAGTCATAACC ref_day_04_0_AAACGCTCAAGCGAGT ref_day_04_0_AAACGCTCACCCTAAA ref_day_04_0_AAAGAACAGCCTTGAT
Xkr4 -0.2033839 -0.2033839 -0.2033839 -0.2033839 -0.2033839
Rp1 -0.0341485 -0.0341485 -0.0341485 -0.0341485 -0.0341485
Sox17 1.7263776 -0.2826476 -0.2826476 -0.2826476 -0.2826476

The PCA was performed with the identified HVG as input. The Elbow plots below show for each sub-datasets how much each PC is informative - in term of the percentage of variance.

In the analysis done by Rossi et al., the PCA was done to compute 30 PCs only. All the downstream analysis was considering all the PCs. %A% all the 30 PCs? La formulation est un peu bizarre %C% … Here, I use the default parameters of the RunPCA function, which computes 50 PCs.

lapply(allSO.list, function(SO) {
    ElbowPlot(SO, ndims = 50) + theme_minimal() + theme(plot.title = element_text(hjust = 0.5, size = 20)) + 
        geom_vline(aes(xintercept = 30), color = "purple", size = 1) + ggtitle(paste0(SO@project.name, "\nElbowPlot"))
})

By taking into account the advises from the Seurat - Guided Clustering Tutorial vignette, the downstream analysis will be done on the 30 first PCs for all the sub-datasets.

Then, I go through the community detection among the cells. The edges are constructed using a k-nearest neighbors (KNN) graph, based on the euclidean distances previously obtained on the PCA space. % using the first 30 components? The weights of the edges are based on the shared overlap in their local neighborhood, a method call Jaccard similarity. These 2 steps are done by using the FindNeighbors function. (https://satijalab.org/seurat/articles/pbmc3k_tutorial.html)

Once I have obtained a graph on the data, I run the FindClusters function. It applies the Louvain algorithm (by default). This algorithm optimizes the modularity of the graph.

  • Modularity allows to assess the best division of a network (or graph). It also depends on a resolution parameter. The higher the value of the resolution, the larger the number of clusters. *

Originally, the analysis done by Rossi et al. set the resolution parameter at 4.6. Here, I test the Louvain algorithm according to multiple levels of resolution (0.1, 0.5, 1, 2 and 5).

res <- c(0.1, 0.5, 1, 2, 5)
allSO.list <- future_lapply(allSO.list, function(SO) {
    SO <- FindNeighbors(SO, dims = 1:30, verbose = FALSE)
    SO <- FindClusters(SO, resolution = res, random.seed = 11, verbose = FALSE)
    return(SO)
}, future.seed = 26)

The relationship between the resolution’s levels and the number of clusters is presented below.

data <- c()
resolution <- c()
nb_clusters <- c()
for (SO in allSO.list) {
    for (ares in res) {
        nb_levels <- nlevels(SO@meta.data[[paste0("RNA_snn_res.", ares)]])

        data <- append(data, SO@project.name)
        resolution <- append(resolution, ares)
        nb_clusters <- append(nb_clusters, nb_levels)
    }
}

df <- data.frame(data, resolution, nb_clusters)
df %>%
    pivot_wider(names_from = "data", values_from = "nb_clusters") %>%
    knitr::kable(caption = "Number of clusters in sub-datasets\naccording to the resolution level") %>%
    kable_styling(bootstrap_options = c("striped", "hover"))
Number of clusters in sub-datasets according to the resolution level
resolution ref_day_04 ref_day_05 ref_day_06 ref_day_07 lab_day_04 lab_day_05 lab_day_05bis lab_day_06 lab_day_11
0.1 5 5 6 7 3 4 4 5 10
0.5 10 10 11 14 9 8 8 14 15
1.0 13 14 17 18 15 15 12 18 24
2.0 25 23 24 27 24 22 22 29 32
5.0 44 43 44 46 47 46 44 48 48

By running a non-linear dimension reduction such as the Uniform Manifold Approximation and Projection (UMAP), the clusters can be visualized.

allSO.list <- lapply(allSO.list, RunUMAP, dims = 1:30, verbose = FALSE)

for (ares in res) {
    print(DimPlot(allSO.list$lab_day_11, group.by = paste0("RNA_snn_res.", ares), label = TRUE, reduction = "umap") + 
        ggtitle(paste0(allSO.list$lab_day_11@project.name, "\nResolution level: ", ares)) + theme(plot.title = element_text(hjust = 0.5)))
}

For the sub-dataset day_11 of the MMG lab, it appears that the best value for the resolution parameter should be 1. %A% et pour les autres? Pourquoi préciser seulement pour celui-la? %C% As an example, for the sub-dataset … I set the Idents of all sub-datasets to the clusters defined under the resolution level 1.%A% je comprends pas ce que c’est le idents ? %C% c’est une sorte de pointeur vers la meta donnes à utiliser par defaut, très utile pour les plots (ici, label = TRUE, va afficher les numeros de cluster pour la resolution 1) déjà vu l.270.

allSO.list <- lapply(allSO.list, function(SO) {
    Idents(SO) <- SO$RNA_snn_res.1
    print(DimPlot(SO, label = TRUE, reduction = "umap") + ggtitle(paste0(SO@project.name, "\nResolution level: 1")) + 
        theme(plot.title = element_text(hjust = 0.5)))
    return(SO)
})
gc()

9 Remove doublets

In single-cell RNA sequencing (scRNAseq), the cells in suspension are captured in a droplet with barcoded microparticles (REFERENCE TO MACOSKO 2015). In most cases, a droplet contains one microparticle and one cell. But sometimes, two or more cells might be captured within a droplet. In that case, we are in presence of a doublet or multiplet.

Doublets are considered as technical artifacts that have to be filtered out. In this part, I will use the R package DoubletFinder (REFERENCE PAPIER DOUBLET FINDER).

This tool require the following steps to be completed.

  • homotypic doublet proportion estimation; %A% juste en dessous le titre c’est heterotypic, c’est bizarre

  • pK parameter optimization

9.1 Heterotypic doublet estimation

Two types of doublets have been identified:

  • homotypic doublets, that derive from transcriptionally similar cells,

  • heterotypic doublets, that derived from transcriptionally distinct cells.

In fact, DoubletFinder is sensitive to the heterotypic doublets, but not to the homotypic ones (from DoubletFinder github repository). %A% ca veeut dire quoi “is sensitive”? Is able to detect ? Ca a l’air d’etre un defaut alors que je pense que c’est une qualité ? %C% oui, sensitive = “able to detect”.

To better identify the heterotypic doublets, it is neccessary to rely on cell annotation.

9.1.1 Cell annotation

The cells need to be annotated for the identification of heterotypic doublets. I will hence annotate the cells of the MMG lab dataset in a similar way as it is done in the pipeline of Rossi et al. I directly load the classifier provided by Rossi et al. for 1165 gene markers. This classifier was generated by Rossi et al. using the data from the Pijuan-Sala et al. atlas. I prefer to warn you that this classifier has nothing related to artificial intelligence classifiers. %A% je mettrais pas ca, je vois pas ce que ca apporte comme info, on se demande pourquoi tu nous avertis, et ca fait poser plein de questions. Tu avais aussi dit au début “a classifier has been trained” ce qui est typiquement de la syntaxe ML/IA, donc c’est bizarre. %C% vu dans l’intro, à modifier.

scmap_classifier <- readRDS(file = paste0(classifier.folder, "scmap_classifier_1000markers.rds"))
ref_stages <- c("E6.5", "E6.75", "E7.0", "E7.25", "E7.5", "E7.75", "E8.0", "E8.25", "E8.5")

I use the scmap package from Kiselev, V., Yiu, A. & Hemberg, M., Nat Methods 2018.

allSO.list <- lapply(allSO.list, function(SO) {
    set.seed(1234567)

    # create the corresponding SingleCellExperiment object
    sce <- as.SingleCellExperiment(x = SO)
    rowData(sce)$feature_symbol <- rownames(sce)
    counts(sce) <- as.matrix(counts(sce))
    logcounts(sce) <- as.matrix(logcounts(sce))

    # apply scmapCluster
    scmapCluster_results <- scmapCluster(projection = sce, index_list = scmap_classifier[ref_stages], threshold = 0)

    # add the celltype from the in vivo atlas to the Seurat object
    SO <- AddMetaData(object = SO, metadata = scmapCluster_results$combined_labs, col.name = "celltype_DF")

    # save memory, do garbage collection
    rm(sce)
    gc()

    return(SO)
})

lapply(allSO.list, function(SO) {
    Idents(SO) <- SO$celltype_DF
    DimPlot(SO, pt.size = 0.8, label = T, label.size = 6, repel = T, cols = colors.use_transferred[levels(Idents(SO))]) + 
        theme_void() + NoLegend() + ggtitle(SO@project.name) + theme(plot.title = element_text(hjust = 0.5, 
        size = 20))
})

9.1.2 Homotypic doublet proportion estimation

As mentioned earlier, DoubletFinder is only sensitive to heterotypic doublets. By obtaining the proportion of homotypic doublets, I prevent DoubletFinder from overestimating the doublets to be removed across cells.%A% je suis pas sure de comprendre, surtout que tu dis que tu fais l’estimation que pour homotypic mais la section suivante parle de l’estimation pour heterotypic %C% je vais revoir l’introduction de cette partie.

homo_prop <- sapply(allSO.list, function(SO) {
    # based on annotation (celltype from pijuansala classifier) could be also the louvain clusters determined
    # from the preprocessus step
    annotations <- SO@meta.data$celltype_DF
    homotypic.prop <- modelHomotypic(annotations)

    return(homotypic.prop)
})

Now, I have an estimated number of homotypic doublets proportion in each of the datasets.

Homotypic doublets proportion
ref_day_04 0.1828770
ref_day_05 0.1280660
ref_day_06 0.2150622
ref_day_07 0.1463046
lab_day_04 0.1394656
lab_day_05 0.1995816
lab_day_05bis 0.1997156
lab_day_06 0.2937258
lab_day_11 0.1429649

9.1.3 Heterotypic estimated doublets

The percentage of doublets is dependant of the number of cells loaded to the device before sequencing, and the prevalence of cell aggregates within the cell suspension.

In the Rossi et al. analysis, this rate was estimated at 7.6%. %A% which rate ? Pourqoui ici on donne la valeur a priori alors que pour les homotypic on fait un calcul sur les données? %C% j’introduirai le calcul de DF plus tôt, dans l’introduction de cette partie. “this rate” vient de 10X genomics. Pour le trouver, il faut consulter le manuel de 10X Here, according to the user guide relative to our sequencing analysis, and given that arround 20.000 cells were loaded for each experiment, I will use the rate 8.0% on the MMG lab data.

So, I apply the percentage of multiplets on the number of cells in each of the sub-datasets. Then, to estimate the number of heterotypic doublets, I multiply the result by the proportion of heterotypic proportion (1 - homotypic doublet proportion).

dblts_nonhomo <- sapply(1:length(allSO.list), function(i) {
    # based on annotation (celltype from pijuansala classifier) could be also the louvain clusters determined
    # from the preprocessus step
    SO <- allSO.list[[i]]
    homotypic.prop <- homo_prop[[i]]

    if (any(grepl("ref", SO@project.name))) {
        nDoublets <- round(ncol(SO) * 7.6/100)
    } else {
        nDoublets <- round(ncol(SO) * 8/100)
    }

    nDoublets_nonhomo <- round(nDoublets * (1 - homotypic.prop))

    return(nDoublets_nonhomo)
})

names(dblts_nonhomo) <- SO.names

Below, there is the estimated number of heterotypic doublets to remove.

Heterotypic estimated doublets
ref_day_04 418
ref_day_05 506
ref_day_06 485
ref_day_07 479
lab_day_04 379
lab_day_05 431
lab_day_05bis 428
lab_day_06 294
lab_day_11 439

9.2 Parameters’ selection

DoubletFinder relies on 3 main parameters:

  • pN, the number of artificial generated doublets - by default = 0.25

  • pK, the PC neighborhood size used to compute proportion of artificial nearest neighbors (pANN), to be determined

  • nExp, the number of predicted doublets, obtained from the previous step.

First, I will use the default value of the pN parameter (i.e. 0.25), as it has been shown in the DoubletFinder article that it has no influence on the tool efficiency.
Second, I previously determined the number of heterotypic doublets to remove for each sub-dataset (see the table above).
Finally, I just have to determine the value of the pK parameter, using the ground-truth-agnostic parameter selection strategy (i.e. BCMVN maximization).

pK.list <- sapply(allSO.list,
                  function(SO){
                    sweep.res.list_SO <- paramSweep_v3(SO, PCs = 1:30) # as estimated from PC elbowPlot
                    sweep.stats_SO <- summarizeSweep(sweep.res.list_SO, GT = FALSE)
                    bcmvn_SO <- find.pK(sweep.stats_SO)
                    
                    ggplot(bcmvn_SO, aes(pK, BCmetric, group = 1)) +
                      geom_point() +
                      geom_line()
                    
                    pK <- bcmvn_SO %>% # select the pK that corresponds to max bcmvn to optimize doublet detection
                      filter(BCmetric == max(BCmetric)) %>%
                      select(pK) 
                    pK <- as.numeric(as.character(pK[[1]]))
                    return(pK)
                  })

names(pK.list) <- SO.names

Below, there is the selected value for the pK parameter for each sub-dataset.

pK parameter values
ref_day_04 0.18
ref_day_05 0.16
ref_day_06 0.13
ref_day_07 0.30
lab_day_04 0.18
lab_day_05 0.25
lab_day_05bis 0.29
lab_day_06 0.19
lab_day_11 0.24

9.3 Doublets identification - DoubletFinder running

With all parameters set I can run DoubletFinder on the sub-datasets.

allSO.list <- lapply(allSO.list, function(SO) {
    SO <- doubletFinder_v3(SO, PCs = 1:30, pN = 0.25, pK = pK.list[[SO@project.name]], nExp = dblts_nonhomo[[SO@project.name]])
})

By running DoubletFinder on the sub-datasets, a new metadata appears in the Seurat objects. The name of this new matadata is DF.classifications, and every cell is tagged either by Singlet or Doublet.

SHOW TABLE DE $DF.CLASSIFICATIONS

Finally, I just have to subset the sub-datasets on the Singlet tagged cells.

allSO.list <- lapply(allSO.list, function(SO) {
    # remove the cells identified as doublets of each of the sub-datasets
    col_dblts <- grep("DF.classifications", colnames(SO@meta.data), value = TRUE)
    Idents(SO) <- col_dblts
    SO <- subset(SO, idents = "Singlet")

    # remove useless columns
    SO@meta.data[[grep("DF.classifications", colnames(SO@meta.data))]] <- NULL
    names(SO@meta.data)[names(SO@meta.data) == grep("pANN", colnames(SO@meta.data), value = T)] <- "pANN"

    return(SO)
})

Size of all the sub-datasets after doublet removal:

size_suball3 <- data.frame(cbind(
    sapply(allSO.list, function(SO){
        if (str_detect(SO@project.name, "^ref")){ "ref" } else { "lab" }
    }),
    sapply(allSO.list, function(SO){
        if (str_detect(SO@project.name, "04")) { "4" }
        else if (str_detect(SO@project.name, "05bis")) { "5bis" }
        else if (str_detect(SO@project.name, "05")) { "5" }
        else if (str_detect(SO@project.name, "06")) { "6" }
        else if (str_detect(SO@project.name, "07")) { "7" }
        else if (str_detect(SO@project.name, "11")) { "11" }
    }),
    
    t(sapply(allSO.list, dim)),
    "Removed Doublets",
    "03"
))
colnames(size_suball3) <- c('origin', 'day', 'Nbr_of_features', 'Nbr_of_cells', 'Analysis_step_name', 'Step_number')
size_suball3$Nbr_of_cells <- as.integer(as.character(size_suball3$Nbr_of_cells))
size_suball3$Nbr_of_features <- as.integer(as.character(size_suball3$Nbr_of_features))
size_suball3$day <- factor(size_suball3$day, levels = c('4', '5', '5bis', '6', '7', '11'), ordered = TRUE)
size_suball_track <- rbind(size_suball_track, size_suball3)
knitr::kable(size_suball_track)
rm(size_suball3)
origin day Nbr_of_features Nbr_of_cells Analysis_step_name Step_number
ref_day_04 ref 4 23961 6898 Preliminary Counts 01
ref_day_05 ref 5 23961 7783 Preliminary Counts 01
ref_day_06 ref 6 23961 8313 Preliminary Counts 01
ref_day_07 ref 7 23961 7502 Preliminary Counts 01
lab_day_04 lab 4 32286 7324 Preliminary Counts 01
lab_day_05 lab 5 32286 7794 Preliminary Counts 01
lab_day_05bis lab 5bis 32286 7729 Preliminary Counts 01
lab_day_06 lab 6 32286 6628 Preliminary Counts 01
lab_day_11 lab 11 32287 9584 Preliminary Counts 01
ref_day_041 ref 4 17928 6726 QC_filtering 02
ref_day_051 ref 5 18282 7637 QC_filtering 02
ref_day_061 ref 6 18247 8133 QC_filtering 02
ref_day_071 ref 7 18279 7385 QC_filtering 02
lab_day_041 lab 4 17679 5518 QC_filtering 02
lab_day_051 lab 5 18786 6728 QC_filtering 02
lab_day_05bis1 lab 5bis 18582 6689 QC_filtering 02
lab_day_061 lab 6 17927 5196 QC_filtering 02
lab_day_111 lab 11 18354 6398 QC_filtering 02
ref_day_042 ref 4 17928 6308 Removed Doublets 03
ref_day_052 ref 5 18282 7131 Removed Doublets 03
ref_day_062 ref 6 18247 7648 Removed Doublets 03
ref_day_072 ref 7 18279 6906 Removed Doublets 03
lab_day_042 lab 4 17679 5139 Removed Doublets 03
lab_day_052 lab 5 18786 6297 Removed Doublets 03
lab_day_05bis2 lab 5bis 18582 6261 Removed Doublets 03
lab_day_062 lab 6 17927 4902 Removed Doublets 03
lab_day_112 lab 11 18354 5959 Removed Doublets 03
ggplot(size_suball_track, aes(x = day, y = Nbr_of_cells)) + geom_col(aes(color = Analysis_step_name, fill = Analysis_step_name), 
    position = position_dodge(), width = 0.6) + theme_classic() + theme(plot.title = element_text(hjust = 0.5)) + 
    ggtitle("Number of cells in each sub-dataset\nafter doublets removal") + facet_grid(. ~ origin)

INTRODUCE WHY TO REDO ALL SEURAT STANDARD PREPROCESSING WORKFLOW because of removing cells, like during QC

10 Preprocessing workflow without doublets

I run again the complete Seurat standard preprocessing workflow, with the same parameters as before. See more details in the section 7.

res <- seq(0.8, 1.6, 0.2)
allSO.list <- lapply(allSO.list, function(SO) {
    SO <- NormalizeData(SO, verbose = FALSE)
    SO <- FindVariableFeatures(SO, nfeatures = 2000, verbose = FALSE)
    SO <- ScaleData(SO, features = rownames(SO), verbose = FALSE)
    SO <- RunPCA(SO, verbose = FALSE)
    ElbowPlot(SO, ndims = 50) + theme_minimal() + theme(plot.title = element_text(hjust = 0.5)) + geom_vline(aes(xintercept = 30), 
        color = "purple", size = 1) + ggtitle(paste0("ElbowPlot\n", SO@project.name))
    SO <- FindNeighbors(SO, dims = 1:30, verbose = FALSE)
    SO <- FindClusters(SO, resolution = res, random.seed = 11, verbose = FALSE)
    SO <- RunUMAP(SO, dims = 1:30)
    for (ares in res) {
        print(DimPlot(SO, group.by = paste0("RNA_snn_res.", ares), label = TRUE, reduction = "umap") + ggtitle(paste0("Resolution level: ", 
            ares, "\n", SO@project.name)) + theme(plot.title = element_text(hjust = 0.5)))
    }
    return(SO)
})

11 Sources of unwanted variation

There are several sources of unwanted variation in a single cell RNA sequencing analysis. Unwanted variation might come from batch effect, but also from the cell-cycle state of the cells, or even the mitochondrial and ribosomal expression level. I will check the batch effect on reference data only, given that I have only one batch for each sub-dataset of the lab. %A% except for day 5, no? Then, I will also check the cell-cycle effect on all the sub-datasets.
Regarding the mitochondrial expression level, I will not study it given that I already applied thresholds on this variable to eliminate low-quality cells.

11.1 Batch effect

As already mentioned, in the MMG lab dataset, we have only one replicate for the sequencing data of day 04, day 06 and day 11. The day 05 has been resequenced, and hence is the only day with two replicates. Looking at the two sub-datasets (lab_day_05 and lab_day_05bis), they are similar enough to be considered as batches of a technical replicate.

11.1.1 MMG Lab data on day 5

# Get only reference sub-datasets %A% bizarre car après ce ne sont pas les données de référence mais du
# labo
lab_datasets <- c("lab_day_05", "lab_day_05bis")
mergeDay5 <- merge(x = allSO.list$lab_day_05, y = allSO.list$lab_day_05bis, project = "lab_day_05", merge.data = T)
mergeDay5@meta.data[["dataset"]] <- "lab_day_05"
Idents(mergeDay5) <- mergeDay5$dataset
mergeDay5 <- NormalizeData(object = mergeDay5, normalization.method = "LogNormalize", scale.factor = 10000, 
    verbose = FALSE)
mergeDay5 <- FindVariableFeatures(object = mergeDay5, nfeatures = 1000, verbose = FALSE)
mergeDay5 <- ScaleData(object = mergeDay5, verbose = FALSE)
mergeDay5 <- RunPCA(mergeDay5, npcs = 30, verbose = FALSE)

# function to do DimPlot entitled with sub-dataset name and if PCA is done before or after the regression
batch_DimPlot <- function(SO, when = NA) {

    Idents(SO) <- SO$replicate
    DimPlot(SO) + ggtitle(paste0("Replicates of ", SO@project.name, "\nin the 2 first PCs ", when, " regression"))
}

batch_DimPlot(mergeDay5, when = "before")

The two batches share a similar expression in the complete space of the two firsts PC. %A% c’est quoi l’espace complet dans ce contexte?

Below, the batch correction is realized. %A% en anglais “realize” c’est “se rendre compte” This step allows us to determine if the batch correction is required or not.

correctedDay5 <- ScaleData(mergeDay5, vars.to.regress = "replicate", verbose = FALSE)
correctedDay5 <- RunPCA(correctedDay5)
batch_DimPlot(correctedDay5, when = "after")

The data after batch correction doesn’t provide more information regarding the two firsts PC. The batch correction will not be kept for the rest of the analysis. The replicates of the day 05 will simply be merged.

allSO.list$lab_day_05 <- mergeDay5
allSO.list$lab_day_05bis <- NULL
SO.names <- names(allSO.list)
rm(mergeDay5, correctedDay5)
gc()

11.1.2 Reference data

The reference data only will be tested on a batch effect regression.%A% pourquoi?

# Get only reference sub-datasets
ref_datasets <- c("ref_day_04", "ref_day_05", "ref_day_06", "ref_day_07")
ref_SO.list <- allSO.list[ref_datasets]

# function to do DimPlot entitled with sub-dataset name and if PCA is done before or after the regression
batch_DimPlot <- function(SO, when = NA) {

    Idents(SO) <- SO$replicate
    DimPlot(SO) + ggtitle(paste0("Replicates of ", SO@project.name, "\nin the 2 first PCs ", when, " regression"))
}

future_lapply(ref_SO.list, batch_DimPlot, when = "before", future.seed = 26)

The variation throughout data does not appear to be batch-based. Let see what occurs after regressing out on the batch value.

ref_SO.list <- future_lapply(ref_SO.list, ScaleData, vars.to.regress = "replicate", verbose = FALSE, future.seed = 26)
ref_SO.list <- future_lapply(ref_SO.list, function(x) RunPCA(x, features = VariableFeatures(x)), future.seed = 26)
future_lapply(ref_SO.list, batch_DimPlot, when = "after", future.seed = 26)

As observed previously, the data variation in the reference data is not lead by the replicates.

In conclusion, there is no need to apply the regression on the sub-datasets.

11.2 cell-cycle effect

It is known that the cell-cycle state of the cells influence the pattern of gene expression. This part is largely inspired by the cell_cycle_vignette provided by Seurat.

11.2.1 Retrieve cell-cycle markers

After getting the list of cell-cycle markers, I want to know if they have been detected in the sequencing.

# A list of cell-cycle markers, from Tirosh et al, 2015, is loaded with Seurat
s.genes <- str_to_title(cc.genes$s.genes)
g2m.genes <- str_to_title(cc.genes$g2m.genes)

# Number of markers for the S and G2M phases
sapply(list(S_phase = s.genes, G2M_phase = g2m.genes), length)

# Are cell-cycle markers in my data ?  check for the length of the intersection
sapply(allSO.list, function(x) length(intersect(rownames(x@assays$RNA), s.genes)))
sapply(allSO.list, function(x) length(intersect(rownames(x@assays$RNA), g2m.genes)))
##   S_phase G2M_phase 
##        43        54 
## ref_day_04 ref_day_05 ref_day_06 ref_day_07 lab_day_04 lab_day_05 lab_day_06 
##         42         42         42         42         42         42         42 
## lab_day_11 
##         42 
## ref_day_04 ref_day_05 ref_day_06 ref_day_07 lab_day_04 lab_day_05 lab_day_06 
##         52         52         52         52         52         52         52 
## lab_day_11 
##         52

Most of the markers have been sequenced in each of the sub-datasets.

11.2.2 Determine the cell-cycle phase of the cells

Now, I can assign cell-cycle scores. It is a score based on the expression of G2/M and S phase markers of each of the cells.

# Apply a cell cycle score
cc_SO.list <- future_lapply(allSO.list, CellCycleScoring, s.features = s.genes, g2m.features = g2m.genes, 
    verbose = FALSE, future.seed = 26)

11.2.3 Plots generation before cell-cycle regression

I create a generic function to plot the data in the first 2 PCs. Each plot is titled depending on whether the PC analysis rely on variables features or the cell-cycle markers, and giving the information the PCA has been done before or after the cell-cycle regression.

run_DimPlot <- function(SO, pca = NA, when = NA) {
    Idents(SO) <- SO$Phase
    p <- DimPlot(SO) + ggtitle(paste0("PCA: ", pca, "\nRegression: ", when)) + theme(text = element_text(size = 10), 
        plot.title = element_text(face = "plain"))
    return(p)
}

First call to that function. The PCA on variable features has already be done during the Pre-process Seurat objects part. Then, a new PCA is done on the cell-cycle markers, followed by a new call to the previously described function.

beforeAll <- future_lapply(cc_SO.list, run_DimPlot, pca = "variable features", when = "before", future.seed = 26)
names(beforeAll) <- SO.names

cc_SO.list <- future_lapply(cc_SO.list, RunPCA, features = c(s.genes, g2m.genes), verbose = FALSE, future.seed = 26)

beforeCCgenes <- future_lapply(cc_SO.list, run_DimPlot, pca = "cell-cycle markers", when = "before", future.seed = 26)
names(beforeCCgenes) <- SO.names

The plots are stored in lists, under the names of the sub-datasets.

11.2.4 Run the cell-cycle regression

# Regression
cc_SO.list <- future_lapply(cc_SO.list, function(x) ScaleData(x, vars.to.regress = c("S.Score", "G2M.Score"), 
    verbose = FALSE), future.seed = 26)

11.2.5 Plots generation after cell-cycle regression

# Plots generation
cc_SO.list <- future_lapply(cc_SO.list, RunPCA, verbose = FALSE, future.seed = 26)
afterAll <- future_lapply(cc_SO.list, run_DimPlot, pca = "variable features", when = "after", future.seed = 26)
names(afterAll) <- SO.names

cc_SO.list <- future_lapply(cc_SO.list, RunPCA, features = c(s.genes, g2m.genes), verbose = FALSE, future.seed = 26)
afterCCgenes <- future_lapply(cc_SO.list, run_DimPlot, pca = "cell-cycle markers", when = "after", future.seed = 26)
names(afterCCgenes) <- SO.names

11.2.6 Plots display

displayPlot <- function(i) {
    p1 <- beforeCCgenes[[i]]
    p2 <- beforeAll[[i]]
    p3 <- afterCCgenes[[i]]
    p4 <- afterAll[[i]]

    grid.arrange(p1, p2, p3, p4, nrow = 2, top = textGrob(SO.names[i], hjust = 0.5, gp = gpar(fontsize = 18, 
        fontface = "bold")))

}

lapply(1:length(cc_SO.list), displayPlot)

It is expected to see a segregation between the cells according to their cell-cycle phase, as the PCA is done using the s.genes and g2m.genes lists. Running a PCA on cell cycle markers reveals, unsurprisingly, that cells separate entirely by phase

12 Conclusion

In this first protocol, I organized the datasets into sub-datasets relating to their origin, reference or lab, and to the day of the experiments on which they have been obtained. Then, I cleaned the sub-datasets by studying the mitochondrial expression in the cells, and removing the outliers. %A% pas seulement, aussi filtre par feature count %C% yep ; ajouter que les donnees sont preprocessees aussi I also removed the doublets. And, at the end I tested whether or not it is necessary to regress the sub-datasets over the batch effect or the cell-cycle phase.

All these sub-datasets have to be saved, to be used in the second protocol of the study, doing it by analysis.

future_lapply(allSO.list, function(x) saveRDS(x, file = paste0(rdsObjects, "01_dailySO_", x@project.name, 
    ".rds")), future.seed = 26)